The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you’d like.
Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data.
Note:
Throughout this document, any season column
represents the year each season started. For example, the 2015-16 season
will be in the data set as 2015. For most of the rest of the project, we
will refer to a season by just this number (e.g. 2015) instead of the
full text (e.g. 2015-16).
Answers
Question 1:
Question 2: 81.6%
Question 3: 46.2%
Question 4: This is a written question. Please leave your response in the document under Question 5.
Question 5: 78.9% of games
Question 6:
Question 7:
Please show your work in the document, you don’t need anything here.
Please write your response in the document, you don’t need anything here.
## Function to install and load packages
install_and_load <- function(package_names) {
# Check which packages are not installed
new_packages <- package_names[!(package_names %in% installed.packages()[, "Package"])]
# Install new packages
if(length(new_packages)) {
install.packages(new_packages)
}
# Load all packages
sapply(package_names, require, character.only = TRUE)
}
# Load in packages
packages <- c("tidyverse", "tidymodels", "ggplot2", "RcppRoll", "vip", "doParallel",
"xgboost","lme4", "finetune", "PlayerRatings", "gt","gtExtras")
install_and_load(packages)
## tidyverse tidymodels ggplot2 RcppRoll vip
## TRUE TRUE TRUE TRUE TRUE
## doParallel xgboost lme4 finetune PlayerRatings
## TRUE TRUE TRUE TRUE TRUE
## gt gtExtras
## TRUE TRUE
# Import relevant csv data
player_data <- read_csv("data/player_game_data.csv", show_col_types = FALSE)
team_data <- read_csv("data/team_game_data.csv", show_col_types = FALSE)
In this section, you’re going to work to answer questions using data from both team and player stats. All provided stats are on the game level.
QUESTION: What was the Warriors’ Team offensive and defensive eFG% in the 2015-16 regular season? Remember that this is in the data as the 2015 season.
# Get an understanding of general df structure of team_data
glimpse(team_data)
## Rows: 27,144
## Columns: 41
## $ season <dbl> 2016, 2016, 2021, 2021, 2016, 2016, 2021, 2022, 201…
## $ gametype <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ nbagameid <dbl> 21600495, 21600495, 22100943, 22100943, 21601032, 2…
## $ gamedate <date> 2016-12-30, 2016-12-30, 2022-03-03, 2022-03-03, 20…
## $ offensivenbateamid <dbl> 1610612740, 1610612752, 1610612742, 1610612744, 161…
## $ off_team_name <chr> "New Orleans Pelicans", "New York Knicks", "Dallas …
## $ off_team <chr> "NOP", "NYK", "DAL", "GSW", "CHI", "UTA", "TOR", "M…
## $ off_home <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, …
## $ off_win <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, …
## $ defensivenbateamid <dbl> 1610612752, 1610612740, 1610612744, 1610612742, 161…
## $ def_team_name <chr> "New York Knicks", "New Orleans Pelicans", "Golden …
## $ def_team <chr> "NYK", "NOP", "GSW", "DAL", "UTA", "CHI", "DET", "B…
## $ def_home <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, …
## $ def_win <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, …
## $ fg2made <dbl> 26, 28, 28, 29, 31, 24, 29, 34, 21, 37, 31, 29, 29,…
## $ fg2missed <dbl> 30, 42, 20, 22, 30, 32, 28, 25, 21, 25, 15, 19, 37,…
## $ fg2attempted <dbl> 56, 70, 48, 51, 61, 56, 57, 59, 42, 62, 46, 48, 66,…
## $ fg3made <dbl> 12, 7, 17, 15, 7, 7, 7, 10, 17, 11, 19, 13, 10, 9, …
## $ fg3missed <dbl> 17, 16, 20, 15, 12, 18, 19, 22, 15, 17, 20, 23, 16,…
## $ fg3attempted <dbl> 29, 23, 37, 30, 19, 25, 26, 32, 32, 28, 39, 36, 26,…
## $ fgmade <dbl> 38, 35, 45, 44, 38, 31, 36, 44, 38, 48, 50, 42, 39,…
## $ fgmissed <dbl> 47, 58, 40, 37, 42, 50, 47, 47, 36, 42, 35, 42, 53,…
## $ fgattempted <dbl> 85, 93, 85, 81, 80, 81, 83, 91, 74, 90, 85, 84, 92,…
## $ ftmade <dbl> 16, 15, 15, 10, 12, 17, 27, 11, 24, 7, 20, 21, 14, …
## $ ftmissed <dbl> 1, 1, 5, 3, 2, 8, 9, 4, 5, 3, 5, 6, 6, 7, 2, 4, 1, …
## $ ftattempted <dbl> 17, 16, 20, 13, 14, 25, 36, 15, 29, 10, 25, 27, 20,…
## $ reboffensive <dbl> 6, 15, 12, 12, 10, 20, 18, 14, 9, 13, 5, 12, 22, 10…
## $ rebdefensive <dbl> 42, 43, 30, 27, 33, 34, 33, 36, 31, 29, 32, 33, 34,…
## $ reboundchance <dbl> 48, 58, 42, 39, 43, 54, 51, 50, 40, 42, 37, 45, 56,…
## $ assists <dbl> 22, 18, 29, 26, 24, 15, 12, 25, 28, 35, 29, 22, 21,…
## $ stealsagainst <dbl> 7, 4, 7, 7, 6, 7, 4, 9, 7, 6, 8, 9, 7, 5, 1, 12, 9,…
## $ turnovers <dbl> 13, 14, 10, 15, 11, 15, 10, 17, 14, 12, 11, 16, 16,…
## $ blocksagainst <dbl> 6, 5, 5, 1, 8, 9, 5, 6, 8, 2, 1, 1, 8, 6, 4, 7, 0, …
## $ defensivefouls <dbl> 17, 11, 16, 16, 11, 18, 27, 14, 19, 9, 14, 18, 18, …
## $ offensivefouls <dbl> 1, 1, 0, 3, 0, 1, 2, 2, 2, 1, 0, 2, 3, 0, 1, 2, 2, …
## $ shootingfoulsdrawn <dbl> 9, 9, 11, 6, 6, 12, 14, 9, 12, 5, 9, 12, 10, 11, 12…
## $ possessions <dbl> 101, 99, 90, 90, 88, 88, 92, 101, 93, 95, 103, 102,…
## $ points <dbl> 104, 92, 122, 113, 95, 86, 106, 109, 117, 114, 139,…
## $ shotattempts <dbl> 91, 100, 90, 86, 84, 91, 94, 97, 84, 94, 93, 96, 98…
## $ andones <dbl> 3, 2, 6, 1, 2, 2, 3, 3, 2, 1, 1, 0, 4, 3, 3, 3, 3, …
## $ shotattemptpoints <dbl> 102, 92, 120, 112, 92, 85, 99, 109, 111, 113, 132, …
# Calculate eFG% -> eFG = ((FGM + (0.5 * 3PM)) / FGA
q1_efg_pct <-
# Offensive eFG%
team_data %>%
filter(season == 2015 & off_team == 'GSW' & gametype == 2) %>% # filters for off team
group_by(off_team) %>%
summarise(
off_efg_pct = (sum(fgmade) + (0.5 * sum(fg3made))) / sum(fgattempted) * 100
) %>%
ungroup() %>%
# Defensive eFG%
inner_join (
team_data %>%
filter(season == 2015 & def_team == 'GSW' & gametype == 2) %>% # filters for def team
group_by(def_team) %>%
summarise(
def_efg_pct = (sum(fgmade) + (0.5 * sum(fg3made))) / sum(fgattempted) * 100
) %>%
ungroup(),
by = c("off_team" = "def_team")
)
# Output both values to 1.d.p
cat(
"Offensive:", q1_efg_pct %>% pull(off_efg_pct) %>% round(1) %>% paste0("% eFG"),
"\nDefensive:", q1_efg_pct %>% pull(def_efg_pct) %>% round(1) %>% paste0("% eFG")
)
## Offensive: 56.3% eFG
## Defensive: 47.9% eFG
ANSWER 1:
Offensive: 56.3% eFG
Defensive: 47.9% eFG
QUESTION: What percent of the time does the team with the higher eFG% in a given game win that game? Use games from the 2014-2023 regular seasons. If the two teams have an exactly equal eFG%, remove that game from the calculation.
# Step 1 : Get all regular season games from 2014 to 2023
q2_team_data <- team_data %>%
filter(season >= 2014 & season <= 2023 & gametype == 2)
# Step 2: Calculate eFG% for each team in each game as per Q1
q2_team_data_pg <- q2_team_data %>%
rename(team_name = 'off_team',
is_win = 'off_win') %>%
group_by(nbagameid, team_name, is_win) %>%
summarise(
efg_pct = (sum(fgmade) + 0.5 * sum(fg3made)) / sum(fgattempted) * 100,
.groups = 'drop'
) %>%
ungroup()
# Step 3: Filter out games where the eFG% for both teams are equal
q2_team_data_pg_filtered <- q2_team_data_pg %>%
select(nbagameid, team_name, is_win, efg_pct) %>% # select relevant cols
group_by(nbagameid) %>%
filter(n() == 2) %>% # ensure 2 rows per game
filter(length(unique(efg_pct)) > 1) %>% # filter out games with equal eFG%
ungroup()
# Step 3: Determine the % of games when team with the higher eFG% won the game
q2_efg_win_data <- q2_team_data_pg_filtered %>%
group_by(nbagameid) %>%
# add the higher eFG% value per match
reframe(
team_name,
is_win,
efg_pct,
max_efg_pct = max(efg_pct)
) %>%
group_by(nbagameid) %>%
# keep only matches where team had the higher eFG%
filter(efg_pct == max_efg_pct) %>%
ungroup() %>%
summarise(
total_games = n(),
wins_by_higher_efg = sum(is_win == 1),
) %>%
summarise(proportion = wins_by_higher_efg/total_games * 100)
# Output value to 1.d.p
cat(
q2_efg_win_data %>% pull(proportion) %>% round(1),
"% of games were won by the team who had the higher eFG%."
)
## 81.6 % of games were won by the team who had the higher eFG%.
ANSWER 2:
81.6%
QUESTION: What percent of the time does the team with more offensive rebounds in a given game win that game? Use games from the 2014-2023 regular seasons. If the two teams have an exactly equal number of offensive rebounds, remove that game from the calculation.
# Step 1a: Reuse q2_team_data from Q2
q3_team_data <- q2_team_data
# Step 1b: Filter out games where the off rebounds for both teams are equal
q3_team_data <- q3_team_data %>%
select(nbagameid, off_team, off_win, reboffensive) %>%
rename("is_win" = off_win) %>%
group_by(nbagameid) %>%
filter(n() == 2) %>%
filter(length(unique(reboffensive)) > 1) %>%
mutate(max_off_reb = max(reboffensive)) %>%
ungroup()
# Step 2: Determine the % of games won by team with the higher off reb count
q3_offreb_win_data <- q3_team_data %>%
group_by(nbagameid) %>%
# keep data for team that had the higher off reb count
filter(reboffensive == max_off_reb) %>%
ungroup() %>%
summarise(
total_games = n(),
wins_by_higher_offreb = sum(is_win == 1),
) %>%
summarise(proportion = wins_by_higher_offreb/total_games * 100)
# Output value to 1.d.p
cat(
q3_offreb_win_data %>% pull(proportion) %>% round(1) %>% paste0("%"),
"of games were won by the team who had more Offensive rebounds."
)
## 46.2% of games were won by the team who had more Offensive rebounds.
ANSWER 3:
46.2%
QUESTION: Do you have any theories as to why the answer to question 3 is lower than the answer to question 2? Try to be clear and concise with your answer.
ANSWER 4:
We can expect the proportion of games won by teams who had a higher eFG% than their opponents across the 2014-2023 regular seasons to be higher than the proportion of games won by teams who had a higher offensive rebound count.
First we should define eFG% which is a direct measurement of team’s shooting efficiency accounting for the value of three-pointers.
Teams that have higher eFG%, shoot the ball more efficiently which means they will have a higher point per possession value. Whereas, teams that shoot less efficiently will have a lower eFG%. This is important as teams that have lower eFG% will miss more shots than a team with higher eFG% and as a result have a higher number of ‘chances’ to grab offensive rebounds.
Intuition tells us that a team that scores less efficiently could have more offensive rebounds but will have a lower points per possession (PPP) value than a team that scores more efficiently and has a higher PPP value. By maximising PPP, we can then assume that a team who has a higher eFG% than their opponents will be capable of winning a higher proportion of games than a team that wins the offensive rebounding battle (as long as the sample size of games is great enough).
QUESTION: Look at players who played at least 25% of their possible games in a season and scored at least 25 points per game played. Of those player-seasons, what percent of games were they available for on average? Use games from the 2014-2023 regular seasons.
For example:
# Get an understanding of general df structure of player_data
glimpse(player_data)
## Rows: 434,797
## Columns: 59
## $ nbagameid <dbl> 21700826, 21700826, 21700826, 21700826, 21700…
## $ gamedate <date> 2018-02-10, 2018-02-10, 2018-02-10, 2018-02-…
## $ season <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
## $ gametype <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ nbapersonid <dbl> 1627821, 1626156, 203917, 1626143, 202391, 20…
## $ player_name <chr> "James Webb III", "D'Angelo Russell", "Nik St…
## $ nbateamid <dbl> 1610612751, 1610612751, 1610612751, 161061275…
## $ team <chr> "BKN", "BKN", "BKN", "BKN", "BKN", "BKN", "BK…
## $ team_name <chr> "Brooklyn Nets", "Brooklyn Nets", "Brooklyn N…
## $ opposingnbateamid <dbl> 1610612740, 1610612740, 1610612740, 161061274…
## $ opp_team <chr> "NOP", "NOP", "NOP", "NOP", "NOP", "NOP", "NO…
## $ opp_team_name <chr> "New Orleans Pelicans", "New Orleans Pelicans…
## $ starter <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, …
## $ missed <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
## $ seconds <dbl> 418.0, 1895.4, 711.6, 308.0, 0.0, 2572.2, 217…
## $ points <dbl> 0, 21, 5, 6, 0, 24, 12, 12, 10, 0, 0, 0, 0, 0…
## $ fg2made <dbl> 0, 2, 1, 3, 0, 3, 3, 2, 1, 0, 0, 0, 0, 0, 4, …
## $ fg2missed <dbl> 0, 3, 0, 1, 0, 10, 9, 2, 2, 0, 0, 0, 0, 0, 2,…
## $ fg2attempted <dbl> 0, 5, 1, 4, 0, 13, 12, 4, 3, 0, 0, 0, 0, 0, 6…
## $ fg3made <dbl> 0, 5, 1, 0, 0, 2, 1, 2, 2, 0, 0, 0, 0, 0, 0, …
## $ fg3missed <dbl> 1, 8, 1, 0, 0, 7, 3, 2, 5, 0, 0, 0, 0, 0, 0, …
## $ fg3attempted <dbl> 1, 13, 2, 0, 0, 9, 4, 4, 7, 0, 0, 0, 0, 0, 0,…
## $ fgmade <dbl> 0, 7, 2, 3, 0, 5, 4, 4, 3, 0, 0, 0, 0, 0, 4, …
## $ fgmissed <dbl> 1, 11, 1, 1, 0, 17, 12, 4, 7, 0, 0, 0, 0, 0, …
## $ fgattempted <dbl> 1, 18, 3, 4, 0, 22, 16, 8, 10, 0, 0, 0, 0, 0,…
## $ ftmade <dbl> 0, 2, 0, 0, 0, 12, 3, 2, 2, 0, 0, 0, 0, 0, 0,…
## $ ftmissed <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ftattempted <dbl> 0, 2, 0, 0, 0, 12, 4, 2, 2, 0, 0, 0, 0, 0, 0,…
## $ reboffensive <dbl> 0, 2, 0, 0, 0, 1, 5, 2, 1, 0, 0, 0, 0, 0, 3, …
## $ rebdefensive <dbl> 0, 7, 2, 1, 0, 3, 4, 8, 4, 0, 0, 0, 0, 0, 6, …
## $ offensivereboundchances <dbl> 7, 38, 13, 5, 0, 45, 41, 35, 44, 0, 0, 0, 0, …
## $ defensivereboundchances <dbl> 3, 23, 8, 2, 0, 44, 41, 37, 34, 0, 0, 0, 0, 0…
## $ assists <dbl> 1, 5, 1, 0, 0, 10, 5, 3, 2, 0, 0, 0, 0, 0, 0,…
## $ steals <dbl> 0, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 0, 0, 0, 0, …
## $ stealsagainst <dbl> 0, 3, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ turnovers <dbl> 0, 5, 0, 0, 0, 4, 1, 1, 1, 0, 0, 0, 0, 0, 0, …
## $ blocks <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, …
## $ blocksagainst <dbl> 0, 0, 0, 0, 0, 4, 2, 0, 3, 0, 0, 0, 0, 0, 0, …
## $ defensivefouls <dbl> 1, 3, 0, 0, 0, 4, 5, 6, 4, 0, 0, 0, 0, 0, 0, …
## $ defensivefoulsdrawn <dbl> 0, 4, 0, 0, 0, 6, 2, 1, 2, 0, 0, 0, 0, 0, 0, …
## $ offensivefouls <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ offensivefoulsdrawn <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shootingfouls <dbl> 1, 1, 0, 0, 0, 1, 3, 3, 2, 0, 0, 0, 0, 0, 0, …
## $ shootingfoulsdrawn <dbl> 0, 0, 0, 0, 0, 6, 2, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ shotattempts <dbl> 1, 18, 3, 4, 0, 27, 18, 9, 10, 0, 0, 0, 0, 0,…
## $ shotattemptpoints <dbl> 0, 19, 5, 6, 0, 24, 12, 12, 8, 0, 0, 0, 0, 0,…
## $ offensiveseconds <dbl> 219.7, 988.1, 355.2, 171.0, 0.0, 1390.8, 1234…
## $ offensivepossessions <dbl> 15.00000, 71.00000, 24.00000, 11.00000, 0.000…
## $ defensiveseconds <dbl> 198.3, 907.3, 356.4, 137.0, 0.0, 1181.4, 944.…
## $ defensivepossessions <dbl> 15, 72, 23, 10, 0, 89, 73, 73, 84, 0, 0, 0, 0…
## $ andones <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ teampoints <dbl> 17, 81, 21, 11, 0, 85, 70, 78, 87, 0, 0, 0, 0…
## $ opponentteampoints <dbl> 26, 84, 33, 18, 0, 100, 82, 74, 88, 0, 0, 0, …
## $ teamshotattempts <dbl> 14, 71, 23, 10, 0, 86, 76, 72, 84, 0, 0, 0, 0…
## $ teamfgmade <dbl> 6, 26, 8, 4, 0, 28, 22, 29, 28, 0, 0, 0, 0, 0…
## $ teamfgattempted <dbl> 13, 66, 22, 9, 0, 79, 70, 70, 79, 0, 0, 0, 0,…
## $ teamturnovers <dbl> 1, 8, 2, 1, 0, 12, 10, 13, 13, 0, 0, 0, 0, 0,…
## $ opponentteamfg2attempted <dbl> 9, 40, 16, 8, 0, 62, 57, 51, 48, 0, 0, 0, 0, …
## $ opponentteamfg3attempted <dbl> 4, 12, 5, 2, 0, 19, 15, 13, 16, 0, 0, 0, 0, 0…
# Get all player data from 2014-2023 regular season games
q5_player_data <- player_data %>%
filter(season >= 2014 & season <= 2023 & gametype == 2)
# Get the avg # of players who scored 25ppg and played at least 25% of games across all seasons
q5_25ppg_25_pct_players <- q5_player_data %>%
group_by(nbapersonid, player_name, season) %>%
summarise (
# total games played
gp = n(),
# points per game
ppg = sum(points)/gp,
# 25% of games played in a season of 82 games
gp_25_pct = 82*0.25,
# flag for if a player played at least 25% of total games
has_played_25_pct = ifelse(gp > gp_25_pct, TRUE, FALSE),
.groups = 'drop'
) %>%
# filter for 25 ppg scorers who played at least 25% of total games
filter(ppg >= 25.0 & has_played_25_pct == TRUE) %>%
arrange(nbapersonid, player_name, season) %>% # sort by player and season
summarise(avg_gp = mean(gp)) # calculate avg
# Output value
cat(
"Of the players who scored at least 25 ppg and played at least 25% of their possible games in season, they were available for",
q5_25ppg_25_pct_players %>% pull(avg_gp) %>% round(1),
"% of games on average."
)
## Of the players who scored at least 25 ppg and played at least 25% of their possible games in season, they were available for 78.9 % of games on average.
ANSWER 5:
78.9% of games
QUESTION: What % of playoff series are won by the team with home court advantage? Give your answer by round. Use playoffs series from the 2014-2022 seasons. Remember that the 2023 playoffs took place during the 2022 season (i.e. 2022-23 season).
# Step 1: Get all team data from 2014-2022 playoff games
q6_team_data <- team_data %>%
filter(season >= 2014 & season <= 2022 & gametype == 4) %>%
arrange(season,gamedate, nbagameid) %>%
select(season:def_win)
# Step 2: Transform into series data identifying round name, team with home adv and series winner
q6_series_data <- q6_team_data %>%
# add a unique identifier for a series
mutate(
series_id = paste(season, offensivenbateamid, defensivenbateamid, sep = "-")
) %>%
group_by(series_id) %>%
# add in win count by team, a flag for series outcome and series end
reframe(
season,
nbagameid,
offensivenbateamid,
off_team,
def_team,
is_home = off_home,
game_number = row_number(),
win_count_off = cumsum(off_win),
win_count_def = cumsum(def_win),
series_outcome = paste0(win_count_off, "-", win_count_def),
series_end = ifelse(win_count_off == 4 | win_count_def == 4, 1, 0)
) %>%
# get only 1 team per game id
group_by(nbagameid) %>%
arrange(offensivenbateamid) %>%
filter(min_rank(offensivenbateamid) == 1) %>%
ungroup() %>%
# add in a cumulative value for series end
arrange(nbagameid) %>%
group_by(season) %>%
mutate(cum_series_end = cumsum(series_end)) %>%
ungroup() %>%
group_by(series_id) %>%
# add in round name based on cumulative series end
mutate(
round_name = case_when(
max(cum_series_end) <= 8 ~ "Round 1",
max(cum_series_end) > 8 & max(cum_series_end) <= 12 ~ "Round 2",
max(cum_series_end) > 12 & max(cum_series_end) <= 14 ~ "Conference Finals",
max(cum_series_end) == 15 ~ "Finals",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
group_by(series_id) %>%
# add in team who had home adv and who won the series by row
mutate(
team_w_home_ad = first(case_when(
game_number == 1 & is_home == 1 ~ off_team,
game_number == 1 & is_home == 0 ~ def_team,
TRUE ~ NA_character_
)),
team_series_won = last(case_when(
win_count_off == 4 ~ off_team,
win_count_def == 4 ~ def_team,
TRUE ~ NA_character_
))
) %>%
ungroup()
# Ordered playoff rounds
playoff_rnd_order <- c('Round 1', 'Round 2', 'Conference Finals', 'Finals')
# Get % of playoff series won by the team with home court advantage
q6_series_win_data <- q6_series_data %>%
filter(series_end == 1) %>%
mutate(round_name = factor(round_name, levels = playoff_rnd_order)) %>%
group_by(round_name) %>%
summarise(
home_ad_series_wins = sum(ifelse(team_w_home_ad == team_series_won,1,0)),
total_series = n(),
pct_won_home_ad = home_ad_series_wins/total_series * 100
) %>%
arrange(round_name)
cat(
"Round 1:", q6_series_win_data %>% pull(pct_won_home_ad) %>% .[1] %>% round(1) %>% paste0("%"),
"\nRound 2:", q6_series_win_data %>% pull(pct_won_home_ad) %>% .[2] %>% round(1) %>% paste0("%"),
"\nConference Finals:", q6_series_win_data %>% pull(pct_won_home_ad) %>% .[3] %>% round(1) %>% paste0("%"),
"\nFinals:", q6_series_win_data %>% pull(pct_won_home_ad) %>% .[4] %>% round(1) %>% paste0("%")
)
## Round 1: 84.7%
## Round 2: 63.9%
## Conference Finals: 55.6%
## Finals: 77.8%
ANSWER 6:
Round 1: 84.7%
Round 2: 63.9%
Conference Finals: 55.6%
Finals: 77.8%
QUESTION: Among teams that had at least a +5.0 net rating in the regular season, what percent of them made the second round of the playoffs the following year? Among those teams, what percent of their top 5 total minutes played players (regular season) in the +5.0 net rating season played in that 2nd round playoffs series? Use the 2014-2021 regular seasons to determine the +5 teams and the 2015-2022 seasons of playoffs data.
For example, the Thunder had a better than +5 net rating in the 2023 season. If we make the 2nd round of the playoffs next season (2024-25), we would qualify for this question. Our top 5 minutes played players this season were Shai Gilgeous-Alexander, Chet Holmgren, Luguentz Dort, Jalen Williams, and Josh Giddey. If three of them play in a hypothetical 2nd round series next season, it would count as 3/5 for this question.
Hint: The definition for net rating is in the data dictionary.
# Question 7a
# Get Regular season teams that had >= +5 rating by season
## Step 1: Get team data correctly sorted and filter for only 2014-2021 regular seasons games
q7_team_data_reg <- team_data %>%
filter(season >= 2014 & season <= 2021 & gametype == 2) %>%
arrange(season, gamedate, nbagameid)
## Step 2: Get the >5 net rating teams by season
q7_team_ratings <- q7_team_data_reg %>%
# Offensive ratings
select(season, gamedate, nbagameid, off_team, offensivenbateamid, points, possessions) %>%
group_by(season, offensivenbateamid, off_team) %>%
reframe(
ortg = sum(points)/(sum(possessions)/100),
.groups = 'drop'
) %>%
rename(
team_name = 'off_team',
nba_team_id = 'offensivenbateamid'
) %>%
# Defensive ratings
inner_join(
q7_team_data_reg %>%
select(season, gamedate, nbagameid, def_team, defensivenbateamid, points, possessions) %>%
group_by(season, defensivenbateamid, def_team) %>%
summarise(
drtg = sum(points)/(sum(possessions)/100),
.groups = 'drop'
) %>%
ungroup() %>%
rename(
team_name = 'def_team',
nba_team_id = 'defensivenbateamid'
),
by = c("season", "team_name", "nba_team_id")
) %>%
mutate(
net_rt = ortg - drtg
) %>%
filter(net_rt >= 5) %>%
select(season, nba_team_id)
# Get the teams that made second round of playoffs by season
## Step 1: Get team data for 2015-2022 playoff games
q7_team_data_playoffs <- team_data %>%
filter(season >= 2015 & season <= 2022 & gametype == 4) %>%
arrange(season, gamedate, nbagameid)
## Step 2: Get the teams by season that made second round of playoffs
q7_teams_in_second_rnd <- q7_team_data_playoffs %>%
# add a unique identifier for a series
mutate(
series_id = paste(season, offensivenbateamid, defensivenbateamid, sep = "-")
) %>%
group_by(series_id) %>%
# add in win count by team, a flag for series outcome and series end
reframe(
season,
nbagameid,
offensivenbateamid,
defensivenbateamid,
off_team,
def_team,
is_home = off_home,
game_number = row_number(),
win_count_off = cumsum(off_win),
win_count_def = cumsum(def_win),
series_end = ifelse(win_count_off == 4 | win_count_def == 4, 1, 0)
) %>%
# get only 1 team per game id
group_by(nbagameid) %>%
arrange(offensivenbateamid) %>%
filter(min_rank(offensivenbateamid) == 1) %>%
ungroup() %>%
# add in a cumulative value for series end
arrange(nbagameid) %>%
group_by(season) %>%
mutate(cum_series_end = cumsum(series_end)) %>%
ungroup() %>%
group_by(series_id) %>%
# add in round name based on cumulative series end
mutate(
round_name = case_when(
max(cum_series_end) <= 8 ~ "Round 1",
max(cum_series_end) > 8 & max(cum_series_end) <= 12 ~ "Round 2",
max(cum_series_end) > 12 & max(cum_series_end) <= 14 ~ "Conference Finals",
max(cum_series_end) == 15 ~ "Finals",
TRUE ~ NA_character_)
) %>%
ungroup() %>%
# filter for Round 2 games only
filter(round_name == 'Round 2') %>%
select(season, offensivenbateamid,defensivenbateamid) %>%
group_by(season) %>%
# get list of unique nba team ids
summarise(nba_team_id = list(unique(c(offensivenbateamid, defensivenbateamid)))) %>%
ungroup() %>%
# unnest team ids
unnest(nba_team_id)
# Get % of +5 rating regular season teams that made 2nd round of playoffs in following year
## Step 1: Reuse +5 net rating teams in season prior
# q7_team_ratings
## Step 2: Get total number of teams that had +5 net rating in season prior
q7_num_teams_rating <- nrow(q7_team_ratings)
## Step 3: Proportion of teams with +5 net rating that go on to second round of playoffs
q7_num_teams_rating_round2 <- q7_team_ratings %>%
mutate(season = season + 1) %>% # Adjust the season in the team ratings to match following season
merge(q7_teams_in_second_rnd, by = c("season", "nba_team_id")) # Merge on season and team_id
cat(
round(nrow(q7_num_teams_rating_round2) / q7_num_teams_rating*100,1),
"% of teams who had at least a +5 net rating in the regular season made it to the second round of playoffs in the following season."
)
## 63.6 % of teams who had at least a +5 net rating in the regular season made it to the second round of playoffs in the following season.
# Question 7b
# Get the players that were on the +5 teams teams that made second of playoffs
## Step 1: Reuse the teams that had +5 net regular season rating in season prior
# q7_team_ratings
## Step 2: Get the players from the teams that had a +5 net regular season rating
q7_players_on_plus_five_teams <- player_data %>%
filter(season >= 2014 & season <= 2021 & gametype == 2) %>%
arrange(season, gamedate, nbagameid) %>%
left_join(q7_team_ratings, by = c("season", "nbateamid" = "nba_team_id")) %>%
select(season, nbapersonid, player_name, nbateamid, team, seconds)
## Step 3: Get the top 5 players in minutes played by team by season
q7_top_five_players_on_plus_five_teams <- q7_players_on_plus_five_teams %>%
group_by(season, nbapersonid, player_name, nbateamid, team) %>%
summarise(total_minutes_played = sum(seconds)/60, .groups = 'drop') %>%
ungroup() %>%
group_by(season, nbateamid, team) %>%
arrange(desc(total_minutes_played)) %>%
slice_head(n = 5) %>%
reframe(nbapersonid = nbapersonid)
## Step 4: Reuse the teams by season that made second round of playoffs
glimpse(q7_teams_in_second_rnd)
## Rows: 64
## Columns: 2
## $ season <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016…
## $ nba_team_id <dbl> 1610612737, 1610612748, 1610612744, 1610612759, 1610612739…
## Step 5: Get the players that were on the teams that made second of playoffs
q7_players_playoffs <- player_data %>%
filter(season >= 2015 & season <= 2022 & gametype == 4) %>%
arrange(season, gamedate, nbagameid) %>%
distinct(season, nbateamid, nbapersonid) %>%
left_join(q7_teams_in_second_rnd, by = c("season", "nbateamid" = "nba_team_id")) %>%
select(season, nbateamid, nbapersonid)
# Find by season by team, the # of players from top 5 mp actually played in following season of playoffs
## Step 1: Increment the season in q7_top_five_players_on_plus_five_teams by 1
q7_top_five_players_next_season <- q7_top_five_players_on_plus_five_teams %>%
select(-team) %>%
mutate(season = season + 1)
## Step 2: Join q7_top_five_players_next_season with q7_players_playoffs
q7_players_in_playoffs <- q7_top_five_players_next_season %>%
inner_join(q7_players_playoffs, by = c("season", "nbateamid", "nbapersonid"))
## Step 3: Join the resulting df back to q7_top_five_players_next_season
q7_top_five_players_with_flag <- q7_top_five_players_next_season %>%
left_join(q7_players_in_playoffs %>%
select(season, nbateamid, nbapersonid) %>%
mutate(in_playoffs = TRUE),
by = c("season", "nbateamid", "nbapersonid")) %>%
mutate(in_playoffs = if_else(is.na(in_playoffs), FALSE, TRUE))
## Step 4: Calculate the % of top 5 players who played in the second round of playoffs by team by season
q7_proportion_top5_playoffs <- q7_top_five_players_with_flag %>%
summarise(
total_top5_players = n(),
top5_in_playoffs = sum(in_playoffs)
) %>%
summarise(proportion_in_playoffs = top5_in_playoffs / total_top5_players * 100)
cat(
"Among those teams,",
q7_proportion_top5_playoffs %>% pull(proportion_in_playoffs) %>% round(1) %>% paste0("%"),
"of their top 5 total minutes played players (regular season) in the +5.0 net rating season played in that 2nd round playoffs series."
)
## Among those teams, 37.8% of their top 5 total minutes played players (regular season) in the +5.0 net rating season played in that 2nd round playoffs series.
cat(
"Percent of +5.0 net rating teams making the 2nd round next year:",
round(nrow(q7_num_teams_rating_round2) / q7_num_teams_rating*100,1) %>% paste0("%"),
"\nPercent of top 5 minutes played players who played in those 2nd round series:",
q7_proportion_top5_playoffs %>% pull(proportion_in_playoffs) %>% round(1) %>% paste0("%")
)
## Percent of +5.0 net rating teams making the 2nd round next year: 63.6%
## Percent of top 5 minutes played players who played in those 2nd round series: 37.8%
ANSWER 7:
Percent of +5.0 net rating teams making the 2nd round next year:
63.6%
Percent of top 5 minutes played players who played in those 2nd round
series: 37.8%
For this part, you will work to fit a model that predicts the winner and the number of games in a playoffs series between any given two teams.
This is an intentionally open ended question, and there are multiple approaches you could take. Here are a few notes and specifications:
Your final output must include the probability of each team winning the series. For example: “Team A has a 30% chance to win and team B has a 70% chance.” instead of “Team B will win.” You must also predict the number of games in the series. This can be probabilistic or a point estimate.
You may use any data provided in this project, but please do not bring in any external sources of data.
You can only use data available prior to the start of the series. For example, you can’t use a team’s stats from the 2016-17 season to predict a playoffs series from the 2015-16 season.
The best models are explainable and lead to actionable insights around team and roster construction. We’re more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery.
Include, as part of your answer:
To predict the outcome of a series between two teams in the NBA Playoffs (2023-2024 season), you can approach this problem either at a game or series level which have different strengths and weaknesses in their approach.
I have chosen to build a model on the game-level using a powerful model called XGBoost to predict each game of a 7-game series in order to predict the outcome of a series between any two playoff teams in a given playoff round from the 2024 playoffs.
This model uses certain inputs including box score statistics that take a player and/or teams averages over the past 5 games throughout the regular season (known as rolling averages), as well as indicators of fatigue and momentum such as days since last played and days until next game. I have also incorporated team ratings that assesses the strength (both defensive and offensive) of a team accounting for variations and fluctuations between- and within-seasons.
By using these inputs, a user can predict the outcome of a NBA playoff series from the 2023-24 season and can also look at the most likely path that a team would take in the playoffs including advancements and eliminations.
By taking a game-level approach, the model can only predict a playoff series as the sum of the individual games within that series. Predicting at the game level has a distinct advantage over predicting at the series level as it allows us to account for the impact of chance and other related factors to occur across the 7 games as players and team can and do under/over perform. However, by taking this approach the model is not capturing season-level specific-metrics such as league standings, home and away season records directly that might help predict the outcome of a playoff series in one season.
Additionally, this model assumes that each game in a given series are independent (not the same and are unrelated). This would normally be disadvantageous as intuition tells us that in a 7 game series momentum matters and that teams who go up 3-0 have never lost a series. However, by assuming series games are independent, this allows the model to be abstracted beyond the 2023-24 NBA playoffs- i.e. the model can be used to predict games or series given updated game data from both the past and in the future.
The last caveat of using this model is that the skill rating model used to calculate team strength was not directly optimised and therefore may bias dominant teams where the natural decay in rating when teams lose may not occur as quickly for teams who have won in previous seasons and then had large reductions in team performance the next.
To address these weaknesses the following could be implemented given
more time and/or data: - Tuning the parameters for the team strength
model (as it is the strongest predictor of game outcome in our model). -
Adding in a specific seasonal component prior so that rolling averages
are computer within-season rather than between games across seasons and
league/home/away records. - Greater exploration into metrics relating to
player and team health given its impact on the 2023-24 NBA playoffs thus
far
accounting for players who not only don’t play but are injured. -
Exploring different combinations or ensemble models for game outcomes. -
If given more data, building a series-level predictor model using a
greater sample of historical playoff series outcomes.
# Helper functions ---------------------------
# Function to count games in the last n days
count_games_last_n_days <- function(dates, n_days) {
sapply(1:length(dates), function(i) {
if (i == 1) {
NA
} else {
sum(dates[i] - dates[1:(i-1)] <= n_days)
}
})
}
# Function to create playoff bracket including 1st Round pre-fill and blank entries for rest of playoffs
create_round_bracket <- function(initial_matchups, round_name, playoff_team_seed) {
round_num <- round_name
# Helper function to expand series into games based on home advantage pattern
expand_series <- function(h_team, a_team, conf_name, round_num) {
# Define the home team pattern based on the game number
home_pattern <- c(h_team, h_team, a_team, a_team, h_team, a_team, h_team)
away_pattern <- c(a_team, a_team, h_team, h_team, a_team, h_team, a_team)
tibble(
conference = ifelse(round_num == 4, "Both", conf_name),
round_number = round_num,
round_name = c("Round 1", "Round 2", "Conference Finals", "Finals")[round_num],
game_number = 1:7,
h_team = home_pattern,
a_team = away_pattern
)
}
# Mapping round names to numbers
round_map <- c("Round 1" = 1, "Round 2" = 2, "Conference Finals" = 3, "Finals" = 4)
round_num <- round_map[[round_name]]
# Generate the data frame for the specified round from the initial match ups
round_bracket <- bind_rows(
lapply(names(initial_matchups), function(conf_name) {
bind_rows(
lapply(names(initial_matchups[[conf_name]]), function(h_team) {
a_team <- initial_matchups[[conf_name]][[h_team]]
expand_series(h_team, a_team, conf_name, round_num)
})
)
})
)
if (round_name == 4) {
round_bracket <- round_bracket %>%
mutate(
season = 2023,
nbagameid = row_number(),
gamedate = as.Date('2024-12-31')
) %>%
left_join(
playoff_team_seed %>% select(team_name,league_seed),
by = c("h_team" = "team_name")
) %>%
rename("h_seed" = "league_seed") %>%
left_join(
playoff_team_seed %>% select(team_name,league_seed),
by = c("a_team" = "team_name")
) %>%
rename("a_seed" = "league_seed")
} else {
round_bracket <- round_bracket %>%
mutate(
season = 2023,
nbagameid = row_number(),
gamedate = as.Date('2024-12-31')
) %>%
left_join(
playoff_team_seed,
by = c("h_team" = "team_name")
) %>%
rename("h_seed" = "seed") %>%
left_join(
playoff_team_seed,
by = c("a_team" = "team_name")
) %>%
rename("a_seed" = "seed")
}
return(round_bracket)
}
# Function to get features for current bracket
bracket_with_features <- function(bracket, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) {
bracket_features <- bracket %>%
inner_join(
most_recent_ratings,
by = c("h_team" = "team")
) %>%
rename("h_rating" = rating) %>%
inner_join(
most_recent_ratings,
by = c("a_team" = "team")
) %>%
rename("a_rating" = rating) %>%
inner_join(
most_recent_rolling_features,
by = c("h_team" = "team")
) %>%
rename_with(~ paste0("h_", .), fg2made:ft_rate) %>%
inner_join(
most_recent_rolling_features,
by = c("a_team" = "team")
) %>%
rename_with(~ paste0("a_", .), fg2made:ft_rate) %>%
inner_join(
most_recent_player_features,
by = c("h_team" = "team")
) %>%
rename_with(~ paste0("h_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
inner_join(
most_recent_player_features,
by = c("a_team" = "team")
) %>%
rename_with(~ paste0("a_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
mutate(
diff_rating = h_rating - a_rating,
diff_fg2made = h_fg2made - a_fg2made,
diff_fg2missed = h_fg2missed - a_fg2missed,
diff_fg2attempted = h_fg2attempted - a_fg2attempted,
diff_fg3made = h_fg3made - a_fg3made,
diff_fg3missed = h_fg3missed - a_fg3missed,
diff_fg3attempted = h_fg3attempted - a_fg3attempted,
diff_fgmade = h_fgmade - a_fgmade,
diff_fgmissed = h_fgmissed - a_fgmissed,
diff_fgattempted = h_fgattempted - a_fgattempted,
diff_ftmade = h_ftmade - a_ftmade,
diff_ftmissed = h_ftmissed - a_ftmissed,
diff_ftattempted = h_ftattempted - a_ftattempted,
diff_reboffensive = h_reboffensive - a_reboffensive,
diff_rebdefensive = h_rebdefensive - a_rebdefensive,
diff_reboundchance = h_reboundchance - a_reboundchance,
diff_assists = h_assists - a_assists,
diff_stealsagainst = h_stealsagainst - a_stealsagainst,
diff_turnovers = h_turnovers - a_turnovers,
diff_blocksagainst = h_blocksagainst - a_blocksagainst,
diff_defensivefouls = h_defensivefouls - a_defensivefouls,
diff_offensivefouls = h_offensivefouls - a_offensivefouls,
diff_shootingfoulsdrawn = h_shootingfoulsdrawn - a_shootingfoulsdrawn,
diff_possessions = h_possessions - a_possessions,
diff_points = h_points - a_points,
diff_shotattempts = h_shotattempts - a_shotattempts,
diff_andones = h_andones - a_andones,
diff_shotattemptpoints = h_shotattemptpoints - a_shotattemptpoints,
diff_ppa = h_ppa - a_ppa,
diff_ppp = h_ppp - a_ppp,
diff_tov_pct = h_tov_pct - a_tov_pct,
diff_blk_pct = h_blk_pct - a_blk_pct,
diff_ortg = h_ortg - a_ortg,
diff_drtg = h_drtg - a_drtg,
diff_ntrg = h_ntrg - a_ntrg,
diff_efg_pct = h_efg_pct - a_efg_pct,
diff_ts_pct = h_ts_pct - a_ts_pct,
diff_ft_rate = h_ft_rate - a_ft_rate,
diff_mean_oreb_pct = h_mean_oreb_pct - a_mean_oreb_pct,
diff_mean_dreb_pct = h_mean_dreb_pct - a_mean_dreb_pct,
diff_mean_tov_pct = h_mean_tov_pct - a_mean_tov_pct,
diff_mean_stl_pct = h_mean_stl_pct - a_mean_stl_pct,
diff_mean_blk_pct = h_mean_blk_pct - a_mean_blk_pct,
diff_mean_usg_pct = h_mean_usg_pct - a_mean_usg_pct,
diff_mean_ast_pct = h_mean_ast_pct - a_mean_ast_pct,
diff_max_usg_pct = h_max_usg_pct - a_max_usg_pct,
diff_avg_mp_starter = h_avg_mp_starter - a_avg_mp_starter,
diff_avg_mp_bench = h_avg_mp_bench - a_avg_mp_bench,
diff_pnts_by_starters = h_pnts_by_starters - a_pnts_by_starters,
diff_pnts_by_bench = h_pnts_by_bench - a_pnts_by_bench,
diff_sharp_shooters = h_sharp_shooters - a_sharp_shooters,
diff_paint_specialists = h_paint_specialists - a_paint_specialists,
diff_game_score_metric = h_game_score_metric - a_game_score_metric,
) %>%
select(
conference:a_seed,
starts_with("diff_"),
h_cumulative_unique_lineups,
a_cumulative_unique_lineups
)
return(bracket_features)
}
# Function for Conference based playoffs rounds
run_series <- function(bracket_features,xgb_last) {
# Create bracket checker
bracket_checker <- bracket_features %>%
select(conference:seed_id) %>%
distinct(conference,round_name,round_number, seed_id) %>%
mutate(
winner = NA,
loser = NA,
total_games = NA
)
# Create distinct conference, seed groups
initial_seed_ids_df <- bracket_features %>%
distinct(conference,seed_id)
conference_groups <- split(initial_seed_ids_df, initial_seed_ids_df$conference)
# Loop through each conference
for(conf in names(conference_groups)) {
# Extract the current conference data frame
conference_data <- conference_groups[[conf]]
for(seed in conference_data$seed_id) {
# Filter rows that match the current series_id
indv_series <- bracket_features %>%
filter(seed_id == seed & conference == conf)
u_seed <- indv_series %>% slice_min(order_by = h_seed, n = 1) %>% distinct(h_team) %>% pull(h_team)
b_seed <- indv_series %>% slice_max(order_by = h_seed, n = 1) %>% distinct(h_team) %>% pull(h_team)
if (length(u_seed) > 1) {
b_seed <- u_seed[2]
u_seed <- u_seed[1]
}
u_seed_wins = 0
u_seed_losses = 0
b_seed_wins = 0
b_seed_losses = 0
for (row_n in 1:nrow(indv_series)) {
indiv_game = indv_series %>%
filter(game_number == row_n)
home_team = indiv_game$h_team
away_team = indiv_game$a_team
pred_winner = predict(
xgb_last %>% extract_workflow(),
new_data = indiv_game,
type = "prob",
)
is_home_win = sample(x = c(1, 0), size = 1, replace = TRUE,
prob = c(pred_winner$.pred_1, pred_winner$.pred_0))
# Updating the series_tracker based on the game outcome
if (is_home_win == 1) {
if (home_team == u_seed) {
# Upper seed wins
u_seed_wins <- u_seed_wins + 1
b_seed_losses <- b_seed_losses + 1
} else {
# Lower seed wins
b_seed_wins <- b_seed_wins + 1
u_seed_losses <- u_seed_losses + 1
}
} else {
if (home_team == u_seed) {
# Upper seed loses
b_seed_wins <- b_seed_wins + 1
u_seed_losses <- u_seed_losses + 1
} else {
# Lower seed loses
u_seed_wins <- u_seed_wins + 1
b_seed_losses <- b_seed_losses + 1
}
}
# Check if either team has won 4 games
if (u_seed_wins >= 4 || b_seed_wins >= 4) {
if (u_seed_wins >= 4) {
# If Upper seed wins they advance
total_games <- u_seed_wins + u_seed_losses
bracket_checker$winner[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- u_seed
bracket_checker$loser[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- b_seed
bracket_checker$total_games[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- total_games
} else{
# If Lower seed wins they advance
total_games <- b_seed_wins + b_seed_losses
bracket_checker$winner[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- b_seed
bracket_checker$loser[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- u_seed
bracket_checker$total_games[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- total_games
}
break # Exit the loop
}
}
}
}
return(bracket_checker)
}
# Function to ensure that the higher seeded team starts as home team in case of upset
align_bracket_seeding <- function(bracket_with_features) {
# Determine which match ups need swapping based on the first game
swap_teams <- bracket_with_features %>%
filter(game_number == 1) %>%
mutate(need_swap = h_seed > a_seed) %>%
select(matchup_id, need_swap)
# Join this back to the original bracket_with_features
bracket_with_features <- bracket_with_features %>%
left_join(swap_teams, by = "matchup_id")
bracket_with_features_aligned <- bracket_with_features %>%
mutate(
# Swap teams
h_team_fixed = ifelse(need_swap, a_team, h_team),
a_team_fixed = ifelse(need_swap, h_team, a_team),
# Swap seeds
h_seed_fixed = ifelse(need_swap, a_seed, h_seed),
a_seed_fixed = ifelse(need_swap, h_seed, a_seed)
) %>%
select(-c(need_swap,a_team, h_team, a_team, h_seed, a_seed)) %>%
rename(
"h_team" = h_team_fixed,
"a_team" = a_team_fixed,
"h_seed" = h_seed_fixed,
"a_seed" = a_seed_fixed
) %>%
select(
conference:game_number,
h_team,a_team,
season:gamedate,
h_seed,a_seed
)
return(bracket_with_features_aligned)
}
# Function to get all combinations of ECF or WCF conferences
get_cf_potential_matchups <- function() {
# Define potential winners in the upper and lower brackets
upper_bracket_winners <- c(1, 4, 5, 8)
lower_bracket_winners <- c(2, 3, 6, 7)
# Generate all combinations of these winners for the conference finals
conference_finals_combinations <- expand.grid(upper_bracket = upper_bracket_winners,
lower_bracket = lower_bracket_winners) %>%
# Ensure the format "higher seed-lower seed"
mutate(Conference_Final_Matchup = ifelse(upper_bracket < lower_bracket,
paste(upper_bracket, lower_bracket, sep = "-"),
paste(lower_bracket, upper_bracket, sep = "-"))) %>%
# Remove duplicates (as some match ups might repeat with seeds flipping)
distinct(Conference_Final_Matchup) %>%
arrange(Conference_Final_Matchup)
return(conference_finals_combinations$Conference_Final_Matchup)
}
# Function run n number of sims using the model and playoff seeding
playoff_sim <- function(sims, xgb_last, playoff_team_seed){
results_list <- list()
final_series_list <- list()
for (sim_no in 1:sims) { # number of sims to run
# Get most recent features
# Ratings
most_recent_ratings <- hist_ratings %>%
left_join(
game_level %>% select(nbagameid,season,gametype),
by = c("season","nbagameid")
) %>%
filter(season == 2023 & gametype == 2) %>%
group_by(team) %>%
top_n(n = 1, wt = nbagameid) %>%
ungroup() %>%
select(-season, -nbagameid, -rating_period, -nbagameid_prev, -gametype)
# Rolling features
most_recent_rolling_features <- rolling_mean_features %>%
left_join(
game_level %>% select(nbagameid,season,gametype),
by = c("season","nbagameid")
) %>%
filter(season == 2023 & gametype == 2) %>%
group_by(team) %>%
top_n(n = 1, wt = nbagameid) %>%
ungroup() %>%
select(-season, -nbagameid, -gametype, -is_home)
# Player features
most_recent_player_features <- player_features %>%
left_join(
game_level %>% select(nbagameid,season,gametype),
by = c("season","nbagameid")
) %>%
filter(season == 2023 & gametype == 2) %>%
group_by(team) %>%
top_n(n = 1, wt = nbagameid) %>%
ungroup() %>%
select(-season, -nbagameid, -gametype)
# Match-ups for 2024 playoff brackets
# Playoff seeds by team
playoff_team_seeding <- data.frame(
team_name = c("BOS", "MIA", "CLE", "ORL", "MIL", "IND", "NYK", "PHI",
"OKC", "NOP", "LAC", "DAL", "MIN", "PHX", "DEN", "LAL"),
seed = c(1, 8, 4, 5, 3, 6, 2, 7,
1, 8, 4, 5, 3, 6, 2, 7),
stringsAsFactors = FALSE
)
# Round 1 initial match ups
initial_matchups <- list(
"East" = list("BOS" = "MIA", "CLE" = "ORL", "MIL" = "IND", "NYK" = "PHI"),
"West" = list("OKC" = "NOP", "LAC" = "DAL", "MIN" = "PHX", "DEN" = "LAL")
)
# Create the playoff bracket
initial_playoff_bracket <- create_round_bracket(initial_matchups,1, playoff_team_seed)
# Join in latest features from last regular season games for each team
initial_bracket_features <- bracket_with_features(initial_playoff_bracket, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
# Add match up_id
mutate(
matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
teams <- c(..1, ..2)
seeds <- c(..3, ..4)
sorted_teams <- teams[order(seeds)]
paste(sorted_teams, collapse = "-")
})
) %>%
group_by(matchup_id) %>%
# Add seed_id
mutate(
seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
) %>%
select(
conference:a_seed,
matchup_id,
seed_id,
starts_with("diff_"),
h_cumulative_unique_lineups,
a_cumulative_unique_lineups
)
# Initial bracket tracker
initial_bracket_checker <- run_series(initial_bracket_features,xgb_last)
# Track winners and losers from R1
e_r1_1_to_8_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "East"]
e_r1_1_to_8_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "East"]
e_r1_4_to_5_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "East"]
e_r1_4_to_5_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "East"]
e_r1_3_to_6_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "East"]
e_r1_3_to_6_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "East"]
e_r1_2_to_7_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "East"]
e_r1_2_to_7_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "East"]
w_r1_1_to_8_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "West"]
w_r1_1_to_8_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "West"]
w_r1_4_to_5_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "West"]
w_r1_4_to_5_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "West"]
w_r1_3_to_6_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "West"]
w_r1_3_to_6_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "West"]
w_r1_2_to_7_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "West"]
w_r1_2_to_7_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "West"]
# print("R1 successful")
# Get R2 match ups
r2_matchups <- list(
"East" = setNames(list(e_r1_4_to_5_winner, e_r1_3_to_6_winner),
c(e_r1_1_to_8_winner, e_r1_2_to_7_winner)),
"West" = setNames(list(w_r1_4_to_5_winner, w_r1_3_to_6_winner),
c(w_r1_1_to_8_winner, w_r1_2_to_7_winner))
)
# Get R2 bracket
r2_playoff_bracket <- create_round_bracket(r2_matchups, 2, playoff_team_seed) %>%
# Add matchup_id for alignment purposes
mutate(
matchup_id = pmap_chr(list(h_team, a_team), ~paste(sort(c(...)), collapse = "-"))
)
# Align bracket before adding features
r2_playoff_bracket_aligned <- align_bracket_seeding(r2_playoff_bracket)
# Add features to R2 bracket
r2_playoff_bracket_features <- bracket_with_features(r2_playoff_bracket_aligned, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
# Add matchup_id
mutate(
matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
teams <- c(..1, ..2)
seeds <- c(..3, ..4)
sorted_teams <- teams[order(seeds)]
paste(sorted_teams, collapse = "-")
})
) %>%
group_by(matchup_id) %>%
# Add seed_id
mutate(
seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
) %>%
select(
conference:a_seed,
matchup_id,
seed_id,
starts_with("diff_"),
h_cumulative_unique_lineups,
a_cumulative_unique_lineups
)
# R2 results
r2_bracket_checker <- run_series(r2_playoff_bracket_features,xgb_last)
# Track winners and losers from R2
e_r2_u_winner <- r2_bracket_checker$winner[
(r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" |
r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
r2_bracket_checker$conference == "East"]
e_r2_u_loser <- r2_bracket_checker$loser[
(r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" |
r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
r2_bracket_checker$conference == "East"]
e_r2_l_winner <- r2_bracket_checker$winner[
(r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" |
r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
r2_bracket_checker$conference == "East"]
e_r2_l_loser <- r2_bracket_checker$loser[
(r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" |
r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
r2_bracket_checker$conference == "East"]
w_r2_u_winner <- r2_bracket_checker$winner[
(r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" |
r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
r2_bracket_checker$conference == "West"]
w_r2_u_loser <- r2_bracket_checker$loser[
(r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" |
r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
r2_bracket_checker$conference == "West"]
w_r2_l_winner <- r2_bracket_checker$winner[
(r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" |
r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
r2_bracket_checker$conference == "West"]
w_r2_l_loser <- r2_bracket_checker$loser[
(r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" |
r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
r2_bracket_checker$conference == "West"]
# print("R2 successful")
# Get Conference finals match ups
r3_matchups <- list(
"East" = setNames(list(e_r2_l_winner),
c(e_r2_u_winner)),
"West" = setNames(list(w_r2_l_winner),
c(w_r2_u_winner))
)
# Get Conference finals bracket
r3_playoff_bracket <- create_round_bracket(r3_matchups, 3, playoff_team_seed) %>%
# Add matchup_id for alignment purposes
mutate(
matchup_id = pmap_chr(list(h_team, a_team), ~paste(sort(c(...)), collapse = "-"))
)
# Align seeding for R2 bracket
r3_playoff_bracket_aligned <- align_bracket_seeding(r3_playoff_bracket)
# Add features to R2 bracket
r3_playoff_bracket_features <- bracket_with_features(r3_playoff_bracket_aligned, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
# Add matchup_id
mutate(
matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
teams <- c(..1, ..2)
seeds <- c(..3, ..4)
sorted_teams <- teams[order(seeds)]
paste(sorted_teams, collapse = "-")
})
) %>%
group_by(matchup_id) %>%
# Add seed_id
mutate(
seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
) %>%
select(
conference:a_seed,
matchup_id,
seed_id,
starts_with("diff_"),
h_cumulative_unique_lineups,
a_cumulative_unique_lineups
)
# Conference finals results
r3_bracket_checker <- run_series(r3_playoff_bracket_features,xgb_last)
# Set R4 match ups
cf_potential_matchups <- get_cf_potential_matchups()
# Track winners and losers for R4 bracket
e_r3_winner <- r3_bracket_checker %>%
filter(seed_id %in% cf_potential_matchups &
conference == "East") %>%
pull(winner)
e_r3_loser <- r3_bracket_checker %>%
filter(seed_id %in% cf_potential_matchups &
conference == "East") %>%
pull(loser)
w_r3_winner <- r3_bracket_checker %>%
filter(seed_id %in% cf_potential_matchups &
conference == "West") %>%
pull(winner)
w_r3_loser <- r3_bracket_checker %>%
filter(seed_id %in% cf_potential_matchups &
conference == "West") %>%
pull(loser)
# print("R3 successful")
# Get Finals match ups
r4_matchups <- list(
"Both" = setNames(list(e_r3_winner),
c(w_r3_winner))
)
# Get Finals bracket
r4_playoff_bracket <- create_round_bracket(r4_matchups, 4, playoff_team_seed) %>%
# Add matchup_id for alignment purposes
mutate(
matchup_id = pmap_chr(list(h_team, a_team), ~paste(sort(c(...)), collapse = "-"))
)
# Align seeding for Finals bracket
r4_playoff_bracket_aligned <- align_bracket_seeding(r4_playoff_bracket)
# Add features to Finals bracket
r4_playoff_bracket_features <- bracket_with_features(r4_playoff_bracket_aligned, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
# Add matchup_id
mutate(
matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
teams <- c(..1, ..2)
seeds <- c(..3, ..4)
sorted_teams <- teams[order(seeds)]
paste(sorted_teams, collapse = "-")
})
) %>%
group_by(matchup_id) %>%
# Add seed_id
mutate(
seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
) %>%
select(
conference:a_seed,
matchup_id,
seed_id,
starts_with("diff_"),
h_cumulative_unique_lineups,
a_cumulative_unique_lineups
)
# Finals results
r4_bracket_checker <- run_series(r4_playoff_bracket_features,xgb_last)
# Track winners and losers for Final
f_r4_winner <- r4_bracket_checker %>%
filter(conference == "Both") %>%
pull(winner)
f_r4_loser <- r4_bracket_checker %>%
filter(conference == "Both") %>%
pull(loser)
# print("R4 successful")
# Add all series results to res list
results_list[[sim_no]] <- bind_rows(initial_bracket_checker, r2_bracket_checker, r3_bracket_checker, r4_bracket_checker)
results_list[[sim_no]]$sim_num <- sim_no
final_series_list[[sim_no]] <- data.frame(
simulation_id = rep(sim_no, 16),
conference = c(
"East","East","East","East","West","West","West","West",
"East","East","West","West",
"East" , "West",
"Both",
"Both"
),
round_made = c(
1,1,1,1,1,1,1,1,
2,2,2,2,
3,3,
4,
5
),
team_name = c(
e_r1_1_to_8_winner,e_r1_4_to_5_winner,e_r1_3_to_6_winner,e_r1_2_to_7_winner,w_r1_1_to_8_winner,w_r1_4_to_5_winner,w_r1_3_to_6_winner,w_r1_2_to_7_winner,
e_r2_u_winner,e_r2_l_winner,w_r2_u_winner,w_r2_l_winner,
e_r3_winner,w_r3_winner,
f_r4_winner,
f_r4_winner
)
)
# print(paste("Finished Sim:",sim_no))
}
results <- bind_rows(results_list)
final_series <- bind_rows(final_series_list)
resultdf <- list('results' = results, 'final_series' = final_series)
print("All sims complete!")
return(resultdf)
}
# Function to get probability of winning between 2 teams in a given round
get_series_prediction_2024 <- function(round, team1, team2, type = "Point Estimate", playoff_team_seed=playoff_team_seeding, sim_results=all_results) {
# Set league seeds for both inputted teams
team1_seed <- playoff_team_seed %>% filter(team_name == team1) %>% pull(league_seed)
team2_seed <-playoff_team_seed %>% filter(team_name == team2) %>% pull(league_seed)
# Get match up in format as sim results
if (team1_seed < team2_seed) {
matchup = paste0(team1,'-',team2)
} else {
matchup = paste0(team2,'-',team1)
}
# Adjust round text output for Finals series
if (round == "Finals") {
round_txt = "the Finals"
} else {
round_txt = round
}
if (type == "Point Estimate") {
title <- "Series Win- Point Estimate"
# Get probability of each team winning overall across rounds
results_summary <- sim_results %>%
filter(
matchup_id == matchup
& round_name == round
) %>%
group_by(matchup_id, round_number, round_name,winner) %>%
summarise(
win_count = n(), # Count the number of times each team has won
avg_total_games = mean(total_games), # Average number of total games played
.groups = 'drop'
) %>%
group_by(matchup_id, round_number) %>%
mutate(
win_pct = win_count / sum(win_count) # Calculate win percentage
) %>%
ungroup() %>%
# Join the logo URLs with the main data frame
left_join(logo_mapping, by = c("winner" = "team_name")) %>%
# Drop unnecessary cols
select(logo_url, win_pct, avg_total_games)
# Create the table using gt
res_table <- gt(results_summary) %>%
tab_header(
title = title,
subtitle = paste("Simulated",round,"series games between", team1, "and", team2)
) %>%
cols_label(
logo_url = "Team",
win_pct = "Win %",
avg_total_games = "Avg # of Games",
) %>%
fmt_percent(
columns = c(win_pct),
decimals = 1
) %>%
fmt_number(
columns = c(avg_total_games),
decimals = 1
) %>%
tab_options(table.width = pct(40)) %>%
gt_img_rows(logo_url) %>%
tab_source_note("The % chance that a team wins a series in a given, average number of games.") %>%
gt_theme_538() %>%
tab_style(
style = cell_text(align = 'center'),
locations = cells_column_labels(columns = everything())
) %>%
tab_style(
style = cell_text(align = 'center'),
locations = cells_body(columns = everything())
)
} else {
# Get probability of each team winning by n games
title <- "Series Win- Probabilistic"
results_summary <- sim_results %>%
filter(
matchup_id == matchup
& round_name == round
) %>%
group_by(matchup_id, round_number, round_name,total_games,winner) %>%
summarise(
win_count = n(), # Count the number of times each team has won
.groups = 'drop'
) %>%
group_by(matchup_id, round_number,total_games) %>%
mutate(
win_pct = win_count / sum(win_count) # Calculate win percentage
) %>%
ungroup() %>%
# Join the logo URLs with the main data frame
left_join(logo_mapping, by = c("winner" = "team_name")) %>%
# Drop unnecessary cols
select(logo_url, win_pct,total_games)
# Create the table using gt
res_table <- gt(results_summary) %>%
tab_header(
title = title,
subtitle = paste("Simulated",round,"series games between", team1, "and", team2)
) %>%
cols_label(
logo_url = "Team",
win_pct = "Win %",
total_games = "# of Games",
) %>%
fmt_percent(
columns = c(win_pct),
decimals = 1
) %>%
tab_options(table.width = pct(40)) %>%
gt_img_rows(logo_url) %>%
tab_source_note("The % chance that a team wins a series when playing a given total number of games.") %>%
gt_theme_538() %>%
tab_style(
style = cell_text(align = 'center'),
locations = cells_column_labels(columns = everything())
) %>%
tab_style(
style = cell_text(align = 'center'),
locations = cells_body(columns = everything())
)
}
return(res_table)
}
# Static variables ---------------------------
# Logo URLs mapped to team names
logo_mapping <- data.frame(
team_name = c("BOS", "MIA", "CLE", "ORL", "MIL", "IND", "NYK", "PHI", "OKC", "NOP", "LAC", "DAL", "MIN", "PHX", "DEN", "LAL"),
logo_url = c(
"https://content.sportslogos.net/logos/6/213/thumbs/slhg02hbef3j1ov4lsnwyol5o.gif",
"https://content.sportslogos.net/logos/6/214/thumbs/burm5gh2wvjti3xhei5h16k8e.gif",
"https://content.sportslogos.net/logos/6/222/thumbs/22253692023.gif",
"https://content.sportslogos.net/logos/6/217/thumbs/wd9ic7qafgfb0yxs7tem7n5g4.gif",
"https://content.sportslogos.net/logos/6/225/thumbs/22582752016.gif",
"https://content.sportslogos.net/logos/6/224/thumbs/22448122018.gif",
"https://content.sportslogos.net/logos/6/216/thumbs/21671702024.gif",
"https://content.sportslogos.net/logos/6/218/thumbs/21870342016.gif",
"https://content.sportslogos.net/logos/6/2687/thumbs/khmovcnezy06c3nm05ccn0oj2.gif",
"https://content.sportslogos.net/logos/6/4962/thumbs/496292922024.gif",
"https://content.sportslogos.net/logos/6/236/thumbs/23655422025.gif",
"https://content.sportslogos.net/logos/6/228/thumbs/22834632018.gif",
"https://content.sportslogos.net/logos/6/232/thumbs/23296692018.gif",
"https://content.sportslogos.net/logos/6/238/thumbs/23843702014.gif",
"https://content.sportslogos.net/logos/6/229/thumbs/22989262019.gif",
"https://content.sportslogos.net/logos/6/237/thumbs/23773242024.gif"
)
)
# Seeding for playoffs
playoff_team_seeding <- data.frame(
team_name = c("BOS", "MIA", "CLE", "ORL", "MIL", "IND", "NYK", "PHI",
"OKC", "NOP", "LAC", "DAL", "MIN", "PHX", "DEN", "LAL"),
seed = c(
1, 8, 4, 5, 3, 6, 2, 7,
1, 8, 4, 5, 3, 6, 2, 7
),
league_seed = c(
1, 16, 11, 12, 8, 15, 7, 14,
2, 9, 5, 6, 4, 10, 3, 13
),
stringsAsFactors = FALSE
)
# Initial setup ---------------------------
# Get game level data
game_level <- team_data %>%
filter(season >= 2014 & off_home == 1) %>%
arrange(season, gamedate, nbagameid) %>%
mutate(gamedate = as.Date(gamedate)) %>%
select(season:gamedate,off_team,off_win,fg2made:shotattemptpoints) %>%
rename_with(~ paste0("h_", .), fg2made:shotattemptpoints) %>%
rename("h_team" = off_team, "is_win" = off_win) %>%
inner_join(team_data %>%
filter(season >= 2014 & off_home == 0) %>%
arrange(season, gamedate, nbagameid) %>%
mutate(gamedate = as.Date(gamedate)) %>%
select(season,nbagameid,off_team,fg2made:shotattemptpoints) %>%
rename_with(~ paste0("a_", .), fg2made:shotattemptpoints) %>%
rename("a_team" = off_team),
by = c("season","nbagameid")
) %>%
select(season:h_team,a_team,is_win,h_fg2made:h_shotattemptpoints,a_fg2made:a_shotattemptpoints)
# Team Features ---------------------------
# Advanced Box score Metrics
game_level <- game_level %>%
# Offensive advanced team stats
mutate(
h_ppa = h_shotattemptpoints/h_shotattempts, # Points per attempt
h_ppp = h_shotattemptpoints/h_possessions, # Points per possession
h_tov_pct = h_turnovers/(h_shotattempts + h_turnovers), # Turnover %
h_blk_pct = a_blocksagainst/a_fg2attempted, # Block %
h_ortg = h_points/(h_possessions/100), # Offensive Rating
h_drtg = a_points/(a_possessions/100), # Defensive Rating
h_ntrg = h_ortg - h_drtg, # Net Rating
h_efg_pct = (h_fgmade + (0.5 * h_fg3made)) / (h_fgattempted * 100), # Effective Field Goal %
h_ts_pct = h_points / (2 * (h_fgattempted + .475 * h_ftattempted)), # True Shooting %
h_ft_rate = h_ftmade / h_fgattempted, # Free Throw Rate
) %>%
# Defensive advanced team stats
mutate(
a_ppa = a_shotattemptpoints/a_shotattempts, # Points per attempt
a_ppp = a_shotattemptpoints/a_possessions, # Points per possession
a_tov_pct = a_turnovers/(a_shotattempts + a_turnovers), # Turnover %
a_blk_pct = h_blocksagainst/h_fg2attempted, # Block %
a_ortg = a_points/(a_possessions/100), # Offensive Rating
a_drtg = h_points/(h_possessions/100), # Defensive Rating
a_ntrg = a_ortg - a_drtg, # Net Rating
a_efg_pct = (a_fgmade + (0.5 * a_fg3made)) / (a_fgattempted * 100), # Effective Field Goal %
a_ts_pct = a_points / (2 * (a_fgattempted + .475 * a_ftattempted)), # True Shooting %
a_ft_rate = a_ftmade / a_fgattempted, # Free Throw Rate
)
# Rolling Averages
# Need to convert back to origin 2 row per match df structure
team_level <- game_level %>%
mutate(h_is_home = 1) %>%
select(season,nbagameid, gamedate, h_team, h_is_home, h_fg2made:h_shotattemptpoints, h_ppa:h_ft_rate) %>%
rename_with(~ str_remove_all(., "h_"), h_team:h_ft_rate) %>%
bind_rows (
game_level %>%
mutate(a_is_home = 0) %>%
select(season,nbagameid, gamedate, a_team, a_is_home, a_fg2made:a_shotattemptpoints, a_ppa:a_ft_rate) %>%
rename_with(~ str_remove_all(., "a_"), a_team:a_ft_rate)
) %>%
arrange(season, nbagameid)
# Get rolling avg for box score and advanced stats
rolling_mean_features <- team_level %>%
mutate_at(
vars(fg2made:ft_rate), # Columns for which we want a rolling mean
.funs = ~ roll_mean(., 5, align = "right", fill = NA) # Rolling mean for last 5 games
) %>%
ungroup() %>%
select(season, nbagameid, team, is_home, fg2made:ft_rate) %>%
filter(!is.na(fg2made))
# Time/Date Features
# Calculate days since last game and days until next game
days_since_stats <- team_level %>%
select(season, nbagameid, gamedate, team) %>%
arrange(season, team, gamedate) %>% # Arrange by season, team, and date
group_by(season, team) %>% # Group by season and team
mutate(
days_since_last_game = c(0, diff(gamedate)), # Calculate days since last game
days_until_next_game = as.integer(lead(gamedate) - gamedate) # Calculate days until next game
) %>%
mutate(
# Reset the days since last game for the first game of each season
days_since_last_game = if_else(row_number() == 1, NA, days_since_last_game),
days_until_next_game = if_else(row_number() == n(), NA, as.integer(days_until_next_game))
) %>%
ungroup()
# Player Features ---------------------------
# Get player level game data
player_level <- player_data %>%
filter(season >= 2014) %>%
mutate(gamedate = as.Date(gamedate)) %>%
arrange(season, gamedate, nbagameid, nbateamid)
# Get number of players injured per team per match
player_features <- player_level %>%
# Calculate various player-specific percentages and metrics
mutate(
oreb_pct = reboffensive / offensivereboundchances, # Offensive rebound percentage
dreb_pct = rebdefensive / defensivereboundchances, # Defensive rebound percentage
tov_pct = turnovers / (fgattempted + turnovers), # Turnover percentage
stl_pct = replace(steals / defensivepossessions, is.infinite(steals / defensivepossessions), NA), # Steal percentage, handling infinite values
blk_pct = replace(blocks / opponentteamfg2attempted, is.infinite(blocks / opponentteamfg2attempted), NA), # Block percentage, handling infinite values
usg_pct = (shotattempts + turnovers) / (teamshotattempts + teamturnovers), # Usage percentage
ast_pct = assists / (teamfgmade - (fg3made + fg2made)), # Assist percentage
pnt3_pct = fg3made / fg3attempted, # 3-point success percentage
pnt2_pct = fg2made / fg2attempted, # 2-point success percentage
h_ast_pct = assists / (fgattempted + (0.475 * (ftattempted + assists + turnovers))), # Hybrid assist percentage
game_score_metric = points + (0.4 * fgmade) - (0.7 * fgattempted) - (0.4 * (ftattempted - ftmade)) + (0.7 * reboffensive) + (0.3 * rebdefensive) + steals + (0.7 * assists) + (0.7 * blocks) - (0.4 * ((defensivefouls + offensivefouls) - turnovers)) # Game score metric calculation
) %>%
# Group by season, game, and team for summary statistics
group_by(season, nbagameid, team) %>%
summarise(
mean_oreb_pct = mean(oreb_pct, na.rm = TRUE),
mean_dreb_pct = mean(dreb_pct, na.rm = TRUE),
mean_tov_pct = mean(tov_pct, na.rm = TRUE),
mean_stl_pct = mean(stl_pct, na.rm = TRUE),
mean_blk_pct = mean(blk_pct, na.rm = TRUE),
mean_usg_pct = mean(usg_pct, na.rm = TRUE),
mean_ast_pct = mean(ast_pct, na.rm = TRUE),
max_usg_pct = max(usg_pct, na.rm = TRUE), # Max usage % as a proxy for teams reliant on star players for success
inj_players = sum(missed), # Total injured players
avg_mp_starter = mean(seconds[starter == 1], na.rm = TRUE) / 60, # Average minutes played by starters
avg_mp_bench = mean(seconds[starter == 0], na.rm = TRUE) / 60, # Average minutes played by bench players
pnts_by_starters = sum(points[starter == 1], na.rm = TRUE), # Points by starters
pnts_by_bench = sum(points[starter == 0], na.rm = TRUE), # Points by bench
sharp_shooters = sum(pnt3_pct > 0.35, na.rm = TRUE), # Count of sharp shooters
paint_specialists = sum(pnt2_pct > 0.50, na.rm = TRUE), # Count of paint specialists
game_score_metric = mean(game_score_metric, na.rm = TRUE),
.groups = 'drop'
) %>%
# Join with data to track unique lineups over time
inner_join(
player_level %>%
filter(starter == 1) %>%
arrange(season, team, nbagameid, nbapersonid) %>%
group_by(season, team, nbagameid) %>%
summarise(lineup = paste(nbapersonid, collapse = "-"), .groups = 'drop') %>%
ungroup() %>%
group_by(season, team) %>%
arrange(season, team, nbagameid) %>%
# Track unique lineups by cumulative count of first occurrences
mutate(
cumulative_unique_lineups = cumsum(!duplicated(lineup))
) %>%
ungroup() %>%
select(-lineup),
by = c("season", "nbagameid", "team")
)
# Team Strength Features ---------------------------
# Build and run the glicko-2. rating system with set parameters
glicko2_model <- glicko2(
game_level %>% arrange(season,nbagameid) %>% mutate(nbagameid = row_number()) %>% select(nbagameid,h_team,a_team,is_win),
status = NULL,
init = c(2200,250,0.03),
tau = 1.2,
history = TRUE
)
# Get historical ratings for each game in training data
hist_ratings <- glicko2_model[2] %>%
# Convert the matrix to a data frame
as.data.frame() %>%
# Add row names as a column for team names
rownames_to_column(var = "team") %>%
# Pivot data longer to transform the data from wide to long format
pivot_longer(
cols = -team,
names_to = "rating_period",
values_to = "rating"
) %>%
# Filter columns that end with '.Lag'
filter(endsWith(rating_period, ".Lag")) %>%
# Extract numbers from 'rating_period' strings
mutate(rating_period = str_extract(rating_period, "\\d+")) %>%
# Temporarily rename the 'rating' column for lag identification
rename(is_lag = rating) %>%
# Join with the main ratings from the Glicko2 model
left_join(
glicko2_model[2] %>%
as.data.frame() %>%
# Add row names as a column for team names
rownames_to_column(var = "team") %>%
# Pivot data longer to transform the data from wide to long format
pivot_longer(
cols = -team,
names_to = "rating_period",
values_to = "rating"
) %>%
# Filter columns that end with '.Rating'
filter(endsWith(rating_period, ".Rating")) %>%
# Extract numbers from 'rating_period' strings
mutate(rating_period = str_extract(rating_period, "\\d+")),
by = c("team", "rating_period")
) %>%
# Filter for entries where 'is_lag' is zero and 'rating' is not the initial value (2200)
filter(is_lag == 0 & rating != 2200) %>%
# Convert 'rating_period' to numeric for sorting
mutate(rating_period = as.numeric(rating_period)) %>%
# Remove the 'is_lag' column
select(-is_lag) %>%
# Arrange by 'rating_period' to ensure chronological order
arrange(rating_period) %>%
# Map game and season IDs from another data set
mutate(
nbagameid = team_level$nbagameid,
season = team_level$season
) %>%
# Group by 'season' and 'team' to handle game-level data
group_by(season, team) %>%
arrange(nbagameid) %>%
# Create a lagged 'nbagameid' to link ratings to specific games
mutate(nbagameid_prev = lag(nbagameid)) %>%
ungroup() %>%
# Filter out any missing values in 'nbagameid_prev'
filter(!is.na(nbagameid_prev))
# Combine features ---------------------------
# Last game look-up helper
last_game_lookup <- team_level %>%
rename(team = team) %>%
group_by(season, team) %>%
arrange(nbagameid) %>%
mutate(nbagameid_prev = lag(nbagameid)) %>%
select(season, team, nbagameid, nbagameid_prev) %>%
filter(!is.na(nbagameid_prev))
# Preparing the features data frame by joining game data with historical team ratings and player statistics
features <- game_level %>%
select(season:is_win) %>%
arrange(season,nbagameid) %>%
# Join in last game look-up df
inner_join(
last_game_lookup,
by = c(
"nbagameid" = "nbagameid",
"h_team" = "team",
"season" = "season"
)
) %>%
rename("h_nbagameid_prev" = nbagameid_prev) %>%
inner_join(
last_game_lookup,
by = c(
"nbagameid" = "nbagameid",
"a_team" = "team",
"season" = "season"
)
) %>%
rename("a_nbagameid_prev" = nbagameid_prev) %>%
# Join in rating system feature
inner_join(
hist_ratings %>% arrange(rating_period),
by = c(
"season" = "season",
"h_nbagameid_prev" = "nbagameid",
"h_team" = "team"
)
) %>%
rename("h_rating" = rating) %>%
inner_join(
hist_ratings %>% arrange(rating_period),
by = c(
"season" = "season",
"a_nbagameid_prev" = "nbagameid",
"a_team" = "team"
)
) %>%
rename("a_rating" = rating) %>%
# Join in rolling mean features
inner_join(
rolling_mean_features %>% select(-is_home),
by = c(
"season" = "season",
"h_nbagameid_prev" = "nbagameid",
"h_team" = "team"
)
) %>%
rename_with(~ paste0("h_", .), fg2made:ft_rate) %>%
inner_join(
rolling_mean_features %>% select(-is_home),
by = c(
"season" = "season",
"a_nbagameid_prev" = "nbagameid",
"a_team" = "team"
)
) %>%
rename_with(~ paste0("a_", .), fg2made:ft_rate) %>%
# Join in player features
inner_join(
player_features,
by = c(
"season" = "season",
"h_nbagameid_prev" = "nbagameid",
"h_team" = "team"
)
) %>%
rename_with(~ paste0("h_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
inner_join(
player_features,
by = c(
"season" = "season",
"a_nbagameid_prev" = "nbagameid",
"a_team" = "team"
)
) %>%
rename_with(~ paste0("a_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
# Reduce number of features by finding difference between home and away teams
mutate(
diff_rating = h_rating - a_rating,
diff_fg2made = h_fg2made - a_fg2made,
diff_fg2missed = h_fg2missed - a_fg2missed,
diff_fg2attempted = h_fg2attempted - a_fg2attempted,
diff_fg3made = h_fg3made - a_fg3made,
diff_fg3missed = h_fg3missed - a_fg3missed,
diff_fg3attempted = h_fg3attempted - a_fg3attempted,
diff_fgmade = h_fgmade - a_fgmade,
diff_fgmissed = h_fgmissed - a_fgmissed,
diff_fgattempted = h_fgattempted - a_fgattempted,
diff_ftmade = h_ftmade - a_ftmade,
diff_ftmissed = h_ftmissed - a_ftmissed,
diff_ftattempted = h_ftattempted - a_ftattempted,
diff_reboffensive = h_reboffensive - a_reboffensive,
diff_rebdefensive = h_rebdefensive - a_rebdefensive,
diff_reboundchance = h_reboundchance - a_reboundchance,
diff_assists = h_assists - a_assists,
diff_stealsagainst = h_stealsagainst - a_stealsagainst,
diff_turnovers = h_turnovers - a_turnovers,
diff_blocksagainst = h_blocksagainst - a_blocksagainst,
diff_defensivefouls = h_defensivefouls - a_defensivefouls,
diff_offensivefouls = h_offensivefouls - a_offensivefouls,
diff_shootingfoulsdrawn = h_shootingfoulsdrawn - a_shootingfoulsdrawn,
diff_possessions = h_possessions - a_possessions,
diff_points = h_points - a_points,
diff_shotattempts = h_shotattempts - a_shotattempts,
diff_andones = h_andones - a_andones,
diff_shotattemptpoints = h_shotattemptpoints - a_shotattemptpoints,
diff_ppa = h_ppa - a_ppa,
diff_ppp = h_ppp - a_ppp,
diff_tov_pct = h_tov_pct - a_tov_pct,
diff_blk_pct = h_blk_pct - a_blk_pct,
diff_ortg = h_ortg - a_ortg,
diff_drtg = h_drtg - a_drtg,
diff_ntrg = h_ntrg - a_ntrg,
diff_efg_pct = h_efg_pct - a_efg_pct,
diff_ts_pct = h_ts_pct - a_ts_pct,
diff_ft_rate = h_ft_rate - a_ft_rate,
diff_mean_oreb_pct = h_mean_oreb_pct - a_mean_oreb_pct,
diff_mean_dreb_pct = h_mean_dreb_pct - a_mean_dreb_pct,
diff_mean_tov_pct = h_mean_tov_pct - a_mean_tov_pct,
diff_mean_stl_pct = h_mean_stl_pct - a_mean_stl_pct,
diff_mean_blk_pct = h_mean_blk_pct - a_mean_blk_pct,
diff_mean_usg_pct = h_mean_usg_pct - a_mean_usg_pct,
diff_mean_ast_pct = h_mean_ast_pct - a_mean_ast_pct,
diff_max_usg_pct = h_max_usg_pct - a_max_usg_pct,
diff_avg_mp_starter = h_avg_mp_starter - a_avg_mp_starter,
diff_avg_mp_bench = h_avg_mp_bench - a_avg_mp_bench,
diff_pnts_by_starters = h_pnts_by_starters - a_pnts_by_starters,
diff_pnts_by_bench = h_pnts_by_bench - a_pnts_by_bench,
diff_sharp_shooters = h_sharp_shooters - a_sharp_shooters,
diff_paint_specialists = h_paint_specialists - a_paint_specialists,
diff_game_score_metric = h_game_score_metric - a_game_score_metric,
) %>%
# Select relevant columns
select(
season,
nbagameid,
gamedate,
h_team,
a_team,
is_win,
starts_with("diff_"),
h_cumulative_unique_lineups,
a_cumulative_unique_lineups
) %>%
mutate(
is_win = as.factor(is_win)
)
# Feature EDA ---------------------------
# Box score metrics P1
features %>%
mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
pivot_longer(diff_fg2made:diff_reboundchance, names_to = "stat", values_to = "value") %>%
ggplot(aes(is_win, value, color = is_win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 4, ncol = 4) +
labs(y = NULL, color = NULL, fill = NULL)
There are no distinct differences in game outcome across these metrics. Intuition tells us that these metrics will not be good discriminators in our model.
# Box score metrics P2
features %>%
mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
pivot_longer(diff_assists:diff_shotattemptpoints, names_to = "stat", values_to = "value") %>%
ggplot(aes(is_win, value, color = is_win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 3, ncol = 6) +
labs(y = NULL, color = NULL, fill = NULL)
diff_blocksagainst is the only metric with a clear
visual difference between game outcomes. Again, the difference box score
metrics appear to mot be good discriminators of wins and losses.
# Advanced team and player metrics
features %>%
mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
pivot_longer(diff_ppa:diff_max_usg_pct, names_to = "stat", values_to = "value") %>%
ggplot(aes(is_win, value, color = is_win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 3, ncol = 6) +
labs(y = NULL, color = NULL, fill = NULL)
There is no clear differences across most of these metrics except for
diff_max_usg_pct and diff_ntrg. It appears
that these advanced metrics overall will not be good discriminators of
wins and loses in our model.
# Rating feature
features %>%
mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
pivot_longer(diff_rating, names_to = "stat", values_to = "value") %>%
ggplot(aes(is_win, value, color = is_win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 1) +
labs(y = NULL, color = NULL, fill = NULL)
We can see the biggest difference in game outcome in the
diff_rating metric. This appears to be our strongest
discriminator of wins and loses thus far.
# Cumulative lineup features
features %>%
mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
pivot_longer(h_cumulative_unique_lineups:a_cumulative_unique_lineups, names_to = "stat", values_to = "value") %>%
ggplot(aes(is_win, value, color = is_win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 2) +
labs(y = NULL, color = NULL, fill = NULL)
We can see clear differences in game outcome between both
a_cumulative_unique_lineups and
h_cumulative_unique_lineups. This is an early indication
that the number of unique starting lineups cumulative across seasons is
a good discriminator of wins and loses for our model, but
diff_rating is still on top.
# Misc Features
features %>%
mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
pivot_longer(diff_avg_mp_bench:diff_game_score_metric, names_to = "stat", values_to = "value") %>%
ggplot(aes(is_win, value, color = is_win)) +
geom_boxplot(alpha = 0.4) +
facet_wrap(~stat, scales = "free_y", nrow = 3) +
labs(y = NULL, color = NULL, fill = NULL)
We can see clear differences in game outcome between
diff_game_score_metric and
diff_paint_specialists but less so in
diff_pnts_by_starters and
diff_pnts_by_bench.
# Model Preparation ---------------------------
# Create Splits (80-20)
splits <- initial_split(
features,
prop = 0.8
)
# Create pre-processing recipe
preprocessing_recipe <-
recipe(is_win ~ ., data = splits %>% training()) %>%
# Removes unnecessary columns
step_rm(season, nbagameid, gamedate, h_team, a_team) %>%
# Removes observations (rows of data) if they contain NA or NaN values
step_naomit(everything(), skip= TRUE) %>%
# Removes any numeric variables that have zero variance
step_zv(all_numeric(), -all_outcomes()) %>%
# Remove highly correlated variables
step_corr(all_numeric(), threshold = 0.8, method = "spearman")
# Observe the recipe on features
features_proprocessed <- preprocessing_recipe %>%
prep() %>%
bake(splits %>% training())
# Set Seed for reproducibility
set.seed(123)
feature_folds <- vfold_cv(training(splits), strata = is_win, v = 5)
# Create XGB boost classification model spec
xgb_spec <- boost_tree(
mode = "classification",
trees = 500,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), # first three: model complexity
sample_size = tune(), mtry = tune(), # randomness
learn_rate = tune() # step size
) %>%
set_engine("xgboost")
# Display model specification
xgb_spec
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 500
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Computational engine: xgboost
# Create model workflow
xgb_wf <- workflow() %>%
add_recipe(preprocessing_recipe) %>%
add_model(xgb_spec)
# Hyper-parameter tuning ---------------------------
# Use anova race to tune the grid and save time on poor performing parameter combinations
doParallel::registerDoParallel()
set.seed(345)
xgb_res <- tune_race_anova(
xgb_wf,
resamples = feature_folds,
grid = 30,
metrics = metric_set(roc_auc),
control = control_race(verbose_elim = TRUE,save_pred=TRUE)
)
## ℹ Evaluating against the initial 3 burn-in resamples.
## i Creating pre-processing data to finalize unknown parameter: mtry
##
## ℹ Racing will maximize the roc_auc metric.
## ℹ Resamples are analyzed in a random order.
## ℹ Fold3: 14 eliminated; 16 candidates remain.
##
## ℹ Fold5: 9 eliminated; 7 candidates remain.
# Plot the parameter combination race
plot_race(xgb_res)
Using tune_race_anova we can eliminated combinations of
parameters that are low performing and only use compute on parameter
combinations that are high performing. We can see that only 3 parameter
combinations made it to the 5th and final race stage.
# Collect metrics for the model training
collect_metrics(xgb_res)
## # A tibble: 7 × 12
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 26 10 3 0.00595 5.38e- 2 0.736 roc_auc
## 2 19 20 9 0.00162 1.23e-10 0.485 roc_auc
## 3 22 6 2 0.00232 3.14e- 1 0.394 roc_auc
## 4 18 24 5 0.00266 1.56e+ 1 0.833 roc_auc
## 5 39 13 2 0.0197 6.49e+ 0 0.578 roc_auc
## 6 12 15 3 0.00494 4.87e- 3 0.687 roc_auc
## 7 33 27 3 0.00108 1.91e- 7 0.441 roc_auc
## # ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
## # .config <chr>
# Show best combination of parameters
show_best(xgb_res, metric = "roc_auc")
## # A tibble: 5 × 12
## mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
## <int> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 33 27 3 0.00108 0.000000191 0.441 roc_auc
## 2 18 24 5 0.00266 15.6 0.833 roc_auc
## 3 22 6 2 0.00232 0.314 0.394 roc_auc
## 4 26 10 3 0.00595 0.0538 0.736 roc_auc
## 5 12 15 3 0.00494 0.00487 0.687 roc_auc
## # ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
## # .config <chr>
# Refit best model on training data and assess performance on test set
xgb_last <-
xgb_wf %>%
finalize_workflow(select_best(xgb_res,metric = "roc_auc")) %>%
last_fit(splits)
# Show metrics
collect_metrics(xgb_last)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.649 Preprocessor1_Model1
## 2 roc_auc binary 0.691 Preprocessor1_Model1
## 3 brier_class binary 0.231 Preprocessor1_Model1
# Capture training predictions
xgb_last_pred <- collect_predictions(xgb_last)
# Display output
xgb_last_pred
## # A tibble: 2,481 × 7
## .pred_class .pred_0 .pred_1 id .row is_win .config
## <fct> <dbl> <dbl> <chr> <int> <fct> <chr>
## 1 1 0.402 0.598 train/test split 17 1 Preprocessor1_Mode…
## 2 0 0.571 0.429 train/test split 22 1 Preprocessor1_Mode…
## 3 1 0.465 0.535 train/test split 23 0 Preprocessor1_Mode…
## 4 0 0.537 0.463 train/test split 37 1 Preprocessor1_Mode…
## 5 0 0.593 0.407 train/test split 38 0 Preprocessor1_Mode…
## 6 1 0.469 0.531 train/test split 42 1 Preprocessor1_Mode…
## 7 0 0.593 0.407 train/test split 43 0 Preprocessor1_Mode…
## 8 0 0.521 0.479 train/test split 44 0 Preprocessor1_Mode…
## 9 1 0.462 0.538 train/test split 54 0 Preprocessor1_Mode…
## 10 1 0.459 0.541 train/test split 56 0 Preprocessor1_Mode…
## # ℹ 2,471 more rows
# Extract variable importance and plot
xgb_fit <- extract_fit_parsnip(xgb_last)
vip(xgb_fit, num_features = 15)
Plotting variable importance allows us to quantify how much a given
feature in our model is explaining game outcome. In this case our
diff_rating variable is explaining our predictor the most,
however also Hollinger’s game score metric and the maximum usage %
player percentage on the team were also important variables. The
cumulative number of lineups for the home and difference in points by
starters between teams were less important but appear in the top 5 for
variable importance in our model.
# Evaluate ROC curve
xgb_last %>%
collect_predictions() %>%
roc_curve(is_win, .pred_1) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(linewidth = 1.5, color = "midnightblue") +
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
linewidth = 1.2
)
ROC curves allow us to assess the classification performance of our
model i.e. how well we predict game outcomes (as wins). It uses a
graphical representation of two variables, sensitivity
(true positive rate) and 1- specificity (false positive
rate). We can see that our model was able to correctly identify when the
predicted team will win and not too often incorrectly predicting a loss
when the outcome was a win.
# 2024 Playoff Simulation ---------------------------
# Set playoff seeding by conference and by league
playoff_team_seeding
## team_name seed league_seed
## 1 BOS 1 1
## 2 MIA 8 16
## 3 CLE 4 11
## 4 ORL 5 12
## 5 MIL 3 8
## 6 IND 6 15
## 7 NYK 2 7
## 8 PHI 7 14
## 9 OKC 1 2
## 10 NOP 8 9
## 11 LAC 4 5
## 12 DAL 5 6
## 13 MIN 3 4
## 14 PHX 6 10
## 15 DEN 2 3
## 16 LAL 7 13
# Set number of simulations
nr_sims <- 1000
# Run the sims and get time elapsed
system.time(
sim_results <- playoff_sim(nr_sims, xgb_last, playoff_team_seeding)
)
## [1] "All sims complete!"
## user system elapsed
## 517.859 2.153 531.439
# Set sim outputs to variables
all_results <- sim_results$results
all_final_series <- sim_results$final_series
# Show the number of simulations that resulted in each team being eliminated at a given stage
results_extended <-
all_final_series %>%
group_by(round_made, team_name) %>%
summarise(
total = n(),
.groups = 'drop'
) %>%
pivot_wider(
names_from = round_made,
values_from = c(total),
values_fill = 0
)
# Display table
results_extended
## # A tibble: 16 × 6
## team_name `1` `2` `3` `4` `5`
## <chr> <int> <int> <int> <int> <int>
## 1 BOS 659 430 269 145 145
## 2 CLE 501 205 91 27 27
## 3 DAL 575 330 162 90 90
## 4 DEN 463 226 120 68 68
## 5 IND 597 300 156 77 77
## 6 LAC 425 222 99 56 56
## 7 LAL 537 274 161 99 99
## 8 MIA 341 171 65 23 23
## 9 MIL 403 160 63 27 27
## 10 MIN 536 282 129 68 68
## 11 NOP 382 142 66 31 31
## 12 NYK 512 282 139 64 64
## 13 OKC 618 306 159 95 95
## 14 ORL 499 194 90 28 28
## 15 PHI 488 258 127 48 48
## 16 PHX 464 218 104 54 54
# Visualise the playoff bracket simulations ---------------------------
# Processing and summarising results
results_proportion <- results_extended %>%
group_by(team_name) %>%
reframe(across(c(`1`, `2`, `3`, `4`, `5`),
~ .x / nr_sims,
.names = "Round {col}")) %>%
select(-`Round 5`) %>%
rename(
'Conference Finals' = `Round 3`,
'Finals' = `Round 4`,
) %>%
arrange(desc(Finals), desc(`Conference Finals`))
# Join the logo URLs with the main data frame
results_proportion <- results_proportion %>%
left_join(logo_mapping, by = "team_name")
# Create probabilities table for advancing rounds
results_proportion %>%
select(-team_name) %>%
select(logo_url,`Round 1`:Finals) %>%
gt() %>%
tab_header(
title = "2023 NBA Playoff Simulations",
subtitle = "The % chance that a team wins that round*",
) %>%
fmt_percent(
columns = c("Round 1", "Round 2", "Conference Finals", "Finals"),
decimals = 1
) %>%
cols_label(
"logo_url" = "Team",
"Round 1" = "Rnd 1",
"Round 2" = "Rnd 2",
"Conference Finals" = "Conf. Finals",
"Finals" = "Finals",
) %>%
tab_options(table.width = pct(50)) %>%
gt_img_rows(logo_url) %>%
tab_source_note("*The proportion of simulated playoff brackets where a team wins or advances on from that round.") %>%
gt_theme_538() %>%
tab_style(
style = cell_text(align = 'center'),
locations = cells_column_labels(columns = everything())
) %>%
tab_style(
style = cell_text(align = 'center'),
locations = cells_body(columns = everything())
)
| 2023 NBA Playoff Simulations | ||||
| The % chance that a team wins that round* | ||||
| Team | Rnd 1 | Rnd 2 | Conf. Finals | Finals |
|---|---|---|---|---|
| 65.9% | 43.0% | 26.9% | 14.5% | |
| 53.7% | 27.4% | 16.1% | 9.9% | |
| 61.8% | 30.6% | 15.9% | 9.5% | |
| 57.5% | 33.0% | 16.2% | 9.0% | |
| 59.7% | 30.0% | 15.6% | 7.7% | |
| 53.6% | 28.2% | 12.9% | 6.8% | |
| 46.3% | 22.6% | 12.0% | 6.8% | |
| 51.2% | 28.2% | 13.9% | 6.4% | |
| 42.5% | 22.2% | 9.9% | 5.6% | |
| 46.4% | 21.8% | 10.4% | 5.4% | |
| 48.8% | 25.8% | 12.7% | 4.8% | |
| 38.2% | 14.2% | 6.6% | 3.1% | |
| 49.9% | 19.4% | 9.0% | 2.8% | |
| 50.1% | 20.5% | 9.1% | 2.7% | |
| 40.3% | 16.0% | 6.3% | 2.7% | |
| 34.1% | 17.1% | 6.5% | 2.3% | |
| *The proportion of simulated playoff brackets where a team wins or advances on from that round. | ||||
# 2024 Playoff Series predictor ---------------------------
# Point estimate example
get_series_prediction_2024("Finals","DAL","BOS", "Point Estimate")
| Series Win- Point Estimate | ||
| Simulated Finals series games between DAL and BOS | ||
| Team | Win % | Avg # of Games |
|---|---|---|
| 51.3% | 5.7 | |
| 48.7% | 5.7 | |
| The % chance that a team wins a series in a given, average number of games. | ||
# 2024 Playoff Series predictor ---------------------------
# Probabilistic
get_series_prediction_2024("Finals","DAL","BOS", "Probabilistic")
| Series Win- Probabilistic | ||
| Simulated Finals series games between DAL and BOS | ||
| Team | Win % | # of Games |
|---|---|---|
| 66.7% | 4 | |
| 33.3% | 4 | |
| 61.5% | 5 | |
| 38.5% | 5 | |
| 26.7% | 6 | |
| 73.3% | 6 | |
| 75.0% | 7 | |
| 25.0% | 7 | |
| The % chance that a team wins a series when playing a given total number of games. | ||
Find two teams that had a competitive window of 2 or more consecutive seasons making the playoffs and that under performed your model’s expectations for them, losing series they were expected to win. Why do you think that happened? Classify one of them as bad luck and one of them as relating to a cause not currently accounted for in your model. If given more time and data, how would you use what you found to improve your model?
# Find teams that made 23 and 24 season playoffs and that under performed in Simulated 2024 Playoffs
p3_playoff_teams <- team_data %>%
filter((season >= 2022 | season <= 2023) & gametype == 4) %>%
distinct(off_team) %>%
rename("team_name" = "off_team")
glimpse(p3_playoff_teams)
## Rows: 30
## Columns: 1
## $ team_name <chr> "DAL", "MIA", "MEM", "OKC", "SAS", "LAL", "PHX", "CHI", "BOS…
# Prediction for Knicks vs Indiana in Round 2 2024 Playoffs
get_series_prediction_2024("Round 2","IND","NYK", "Probabilistic")
| Series Win- Probabilistic | ||
| Simulated Round 2 series games between IND and NYK | ||
| Team | Win % | # of Games |
|---|---|---|
| 57.6% | 4 | |
| 42.4% | 4 | |
| 52.1% | 5 | |
| 47.9% | 5 | |
| 50.5% | 6 | |
| 49.5% | 6 | |
| 40.4% | 7 | |
| 59.6% | 7 | |
| The % chance that a team wins a series when playing a given total number of games. | ||
# Prediction for Denver vs Lakers in Round 2 2024 Playoffs
get_series_prediction_2024("Round 1","DEN","LAL", "Point Estimate")
| Series Win- Point Estimate | ||
| Simulated Round 1 series games between DEN and LAL | ||
| Team | Win % | Avg # of Games |
|---|---|---|
| 46.3% | 5.8 | |
| 53.7% | 5.8 | |
| The % chance that a team wins a series in a given, average number of games. | ||
ANSWER :
By assessing the simulation output for the 2024 NBA Playoffs it appears that despite making the playoffs two consecutive seasons in a row both the New York Knicks and Lakers under performed against my models expectations.
The Knicks had more than a 50% chance of beating the Indiana Pacers to reach the Eastern Conference Finals when the series went to 5 and 7 games but ended up losing in 7 games to the Pacers. This particular example is a case of injuries affecting the outcome of the series. Despite the model predicting the Knicks had a better chance of winning the more games played in the series, 4 starting rotation players combined for only 3 games played due to injury. Without their expected, strongest playoff lineup the Knicks fell surprisingly at home in Game 7 to the Pacers. If given more data, specific metrics targeting starters minutes lost due to injuries not just missed games as well as other player health metrics, this might produce better predictions. given the injury history of a team before and during a playoff series.
The Los Angeles Lakers had greater than 50% chance of beating the Denver Nuggets to reach Round 2 of the playoffs in the West but lost in 5 games. Despite the Denver Nuggets being the previous seasons champion, the model still favoured the Lakers due to the lack in decay in the ratings for team strength since their Finals win in the year of the COVID-19 NBA bubble. The Denver Nuggets were clear favourites in this series and the model does not fully account for this. To fix this limitation in the model, an adjustment to the input parameters to the Glicko-2 rating system model that would reduce the numbers of (lost) games before rating decay begins would would be sufficient.